Using Data Science To Understand Police Brutality¶

By: Sareet Nayak

Motivation¶

black_lives_matter_feature.jpg

Police brutality is a public health crisis in the US. Police brutality is a term used to describe the unwarranted and excessive force police officers (and other law enforcement) use to kill and injure normal civilians, whether it is verbal or physical abuse. Black people are affected by police brutality at the highest rate. After the 2014 fatal shooting of Michael Brown, at the hands of a police officer, the “Black Lives Matter” movement grew which is a political and social movement to increase accountability and awareness of police brutality and racially motivated acts of violence against black people. I do not stand for the unjust treatment of anyone, especially not black people in light of police brutality and have partaken in protests and demonstrations in my own community and will continue for as long as this problem persists.

While reading further on the “Black Lives Matter” movement, I found that the Washington Post has published data of many fatal shootings between 2015 and 2022, in the US. I am saying many, not all, because many causes of police brutality are undocumented. This data includes the victim’s name, gender, age, race, threat level, and signs of mental illness, in addition, to data on the circumstances of the murder which includes the data, manner of death, whether the victim was armed, fleeing, or threatening, the location of the murder, and whether or not the police’s body camera was off.

Through my involvement and research, I was interested in understanding the variables that affect police brutality that are not readily known or easily observed. I wanted to approach this using my newfound background and interest in Data Science. Hopefully, these insights can be used by policymakers to overcome the biases in policing.

Introduction¶

This semester, I am taking a class called “CMSC 320: Introduction to Data Science” where we had the opportunity to learn how to tell stories with data, through the scope of the industry standard data science pipeline. The data science pipeline includes: 1) data collection 2) data management and representation 3) exploratory data analysis 4) hypothesis testing and 5) communication of insights attained. Through the process of this project, I will walk you through the entire data science pipeline in the context of my project. Enjoy!

Imports¶

In [1]:
#Making imports to the relevant Python libraries

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt 
from datetime import datetime
import folium
from geopy.geocoders import Nominatim
import requests 
from bs4 import BeautifulSoup
import pgeocode
import plotly.express as px
import math
from collections import defaultdict
import plotly.express as px
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Data Collection¶

I imported my data into a dataframe called “police_killings”. This process was fairly simple because all of the data was readily accessible from Kaggle (https://www.kaggle.com/datasets/kwullum/fatal-police-shootings-in-the-us).

In [2]:
#Reading information on all of the police killings and storing it in a dataframe called "police_killings"

police_killings = pd.read_csv('PoliceKillingsUS.csv', encoding='cp1252')

Data Management¶

In the data management stage, I did data transformation and data cleaning. The difference between data transformation and data cleaning is that data transformation focuses on restructuring the data to be in a more usable format, while data cleaning focuses on removing unnecessary contents and modifying inaccuracies. In terms of cleaning, I removed the manner of death and age column as they were not relevant to my studies and removed any entries which had a missing value in the column of armed, race, or flee because those were the only ones with missing values and missing values would not be relevant to my study. In terms of transformation, I converted the date column to be in the proper datetime format ('%m-%d-%Y') because this would be easier to plot, changed gender entries from “M” and “F” to “Male" and "Female" to make the table more readable, changed race entries from “A”, “W”, “H”, “B”, “N”, and “O” to “Asian”, “White, “Hispanic”, “Black”, “Native”, and “Other” respectively, and changed the columns to be more specific and readable. Additionally, I added a new column which represents if the victim was armed or not, depending on the value of the “Armed With” column.

In [3]:
#Removing columns which are irrelevant to my study 
police_killings = police_killings.drop(['manner_of_death','age'], axis=1)

#Dropping any entries which have missing values in any of the columns: "armed", "race", and "flee"
police_killings = police_killings.dropna(subset=['armed', 'race', 'flee'])

#Iterating through the dataframe "police_killings" and changing the "date" column to be in a proper datetime format
fixed_dates = []
for index, row in police_killings.iterrows():
    date_str = ""
    date_str += row['date'].split('/')[1]
    date_str += '-'
    date_str += row['date'].split('/')[0]
    date_str += '-20'
    date_str += row['date'].split('/')[2]
    date_object = datetime.strptime(date_str, '%m-%d-%Y').date()
    fixed_dates.append(date_object)
police_killings['date'] =  fixed_dates

#Iterating through the dataframe "police_killings" and changing any entries whose gender column is "M" to "Male" and otherwise 
fixed_gender = []
for index, row in police_killings.iterrows():
    if (row['gender'] == 'M'):
        fixed_gender.append('Male')
    else:
        fixed_gender.append('Female')
police_killings['gender'] =  fixed_gender

#Iterating throough the dataframe "police_killings" and changining any entries whose race is "A" to "Asian" and otherwise
fixed_race = []
for index, row in police_killings.iterrows():
    if (row['race'] == 'A'):
        fixed_race.append('Asian')
    elif (row['race'] == 'W'):
        fixed_race.append('White')
    elif (row['race'] == 'H'):
        fixed_race.append('Hispanic')
    elif (row['race'] == 'B'):
        fixed_race.append('Black')
    elif (row['race'] == 'N'):
        fixed_race.append('Native')
    else:
        fixed_race.append('Other')
police_killings['race'] =  fixed_race

#Resetting the column names to make them more intuitive and reasonable 
police_killings.columns = ['ID', 'Name', 'Date', 'Armed With', 'Gender', 'Race', 'City', 'State', 'Has Signs of Mental Illness','Threat Level', 'Fleeing', 'Has Body Camera']

After doing data cleansing, I was able to ensure that I had no missing values. This was good, because I did not have to do extra steps (removal or imputation) to modify the missing data. Removal is straightforward, while imputation is the process of replacing missing values with educated guesses like mean or median.

In [4]:
columns = police_killings.columns

print("Missing Value Count per Column")

#Printing the number of missing values (NaNs) per column
for column in columns:
    print(str(column) + " has " + str(police_killings[column].isnull().sum()) + " missing values.")
Missing Value Count per Column
ID has 0 missing values.
Name has 0 missing values.
Date has 0 missing values.
Armed With has 0 missing values.
Gender has 0 missing values.
Race has 0 missing values.
City has 0 missing values.
State has 0 missing values.
Has Signs of Mental Illness has 0 missing values.
Threat Level has 0 missing values.
Fleeing has 0 missing values.
Has Body Camera has 0 missing values.
In [5]:
#Converting the well formatted "Date" column into a datetime object with the following format
police_killings['Date'] = pd.to_datetime(police_killings['Date'], format="%Y-%m-%d", utc=False)
In [6]:
armed = []

#Creating a new column for whether or not the victim was armed or not. Iterating throough the dataframe "police_killings" and append "No" when the "Armed With" column said "unarmed" and "Yes" otherwise 
for index, row in police_killings.iterrows():
    if (row['Armed With'] == 'unarmed'):
        armed.append('No')
    else:
        armed.append('Yes')

police_killings['Armed'] = armed
In [7]:
#Fixing the formatting of all the items within the "Threat Level" and "Fleeing" columns so that they match title case
police_killings["Threat Level"] = police_killings["Threat Level"].str.title()
police_killings["Fleeing"] = police_killings["Fleeing"].str.title()
In [8]:
#Printing the first few rows of the "police_killings" dataframe
police_killings
Out[8]:
ID Name Date Armed With Gender Race City State Has Signs of Mental Illness Threat Level Fleeing Has Body Camera Armed
0 3 Tim Elliot 2015-01-02 gun Male Asian Shelton WA True Attack Not Fleeing False Yes
1 4 Lewis Lee Lembke 2015-01-02 gun Male White Aloha OR False Attack Not Fleeing False Yes
2 5 John Paul Quintero 2015-01-03 unarmed Male Hispanic Wichita KS False Other Not Fleeing False No
3 8 Matthew Hoffman 2015-01-04 toy weapon Male White San Francisco CA True Attack Not Fleeing False Yes
4 9 Michael Rodriguez 2015-01-04 nail gun Male Hispanic Evans CO False Attack Not Fleeing False Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2523 2808 Kesharn K. Burney 2017-07-26 vehicle Male Black Dayton OH False Attack Car False Yes
2525 2820 Deltra Henderson 2017-07-27 gun Male Black Homer LA False Attack Car False Yes
2528 2812 Alejandro Alvarado 2017-07-27 knife Male Hispanic Chowchilla CA False Attack Not Fleeing False Yes
2533 2817 Isaiah Tucker 2017-07-31 vehicle Male Black Oshkosh WI False Attack Car True Yes
2534 2815 Dwayne Jeune 2017-07-31 knife Male Black Brooklyn NY True Attack Not Fleeing False Yes

2282 rows × 13 columns

Exploratory Data Analysis¶

Through exploratory data analysis and visualization, I wanted to answer some questions: 1) How does the frequency of police killings change overtime? 2) Is there a relationship with demographic data and the frequency of killings per state? 3) How does location affect the frequency of police killings?

In [9]:
#Creating a new column in the dataframe which mimics the date, but only contains the year and month
police_killings['year_month'] = police_killings['Date'].map(lambda dt: dt.strftime('%Y-%m'))
#Grouping the "police_killings" dataframe using the new "year_month" column that was defined
grouped_ym = police_killings.groupby(police_killings['year_month']).size().to_frame("count").reset_index()
plt.figure(figsize=(8, 8))
#Plotting the groupby object against count, which is an aggregator which will fundamentally count the number of entries (which in this case is police killings) by year/month
grouped_ym.plot(kind = "line", x = 'year_month', y = "count")

plt.title('Frequency of Police Killings (per month, 2015-2017)')
plt.xlabel('Date')
plt.ylabel('Frequency of Police Killings')
Out[9]:
Text(0, 0.5, 'Frequency of Police Killings')
<Figure size 576x576 with 0 Axes>

There is a decreasing trend in the frequency of killings overtime. However, there are some random peaks and I am interested in understanding what are causing those disruptions from the trend.

At this point, I import some more demographic data, again from https://www.kaggle.com/datasets/kwullum/fatal-police-shootings-in-the-us. The following code blocks convert the original dataframe called “police_killings” into a new dataframe called “state_info”. Unlike the original dataframe, this new data frame includes each state’s victim count (across 2015 and 2017) and different demographic information such as the percent of males, females, hispanic people, white people, black people, native people, asian people, other raced people, diploma holders, poverty rate, and more. It’s important to mention, there’s a column called “Percent Victims of Population”. I wanted to better represent the victim count because the states which have a greater population would have a higher victim count, so I made the victim count a percentage of the entire population of the state.

In [10]:
#Creating a dictionary which contains state names as the key and the abbreviation as the values 
state_abbrev = { "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", "California": "CA", "Colorado": "CO", "Connecticut": "CT", "Delaware": "DE", "Florida": "FL", "Georgia": "GA", "Hawaii": "HI", "Idaho": "ID", "Illinois": "IL", "Indiana": "IN", "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA", "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI", "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT", "Nebraska": "NE", "Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ", "New Mexico": "NM", "New York": "NY", "North Carolina": "NC", "North Dakota": "ND", "Ohio": "OH", "Oklahoma": "OK", "Oregon": "OR", "Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC", "South Dakota": "SD", "Tennessee": "TN", "Texas": "TX", "Utah": "UT", "Vermont": "VT", "Virginia": "VA", "Washington": "WA", "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY", "District of Columbia": "DC", "American Samoa": "AS", "Guam": "GU", "Northern Mariana Islands": "MP", "Puerto Rico": "PR", "United States Minor Outlying Islands": "UM", "U.S. Virgin Islands": "VI", }
In [11]:
#Reading in a new dataset, which contains the percent of males/females per state, into a new dataframe called "state_gender"
state_gender = pd.read_csv('state_gender.csv')
state_gender = state_gender.iloc[:, :-3]
state_gender = state_gender.iloc[:-1]
states = []

for index, row in state_gender.iterrows():
    states.append(state_abbrev[row['State']])

#Converting the entries which are objects with "%" at the end into floats where each percent is a whole number. For example, 48% would be written as 48
state_gender['State'] = states
state_gender['Male'] = state_gender['Male'].str[:-1]
state_gender['Male'] = state_gender['Male'].astype('float64')
state_gender['Female'] = state_gender['Female'].str[:-1]
state_gender['Female'] = state_gender['Female'].astype('float64')
In [12]:
#Reading in a new dataset, which contains the percent of different races per state, into a new dataframe called "state_info"
#This new "state_info" dataframe will be augmented to (or merged) with other demographic information 
state_info = pd.read_csv('state_race.csv')
state_info = state_info.iloc[:-5]

hispanic = []
white = [] 
black = []
native = [] 
asian = [] 
other = []
states = []

#Converting the decimal representation of a percent into floats where each decimal is a whole number. For example, 0.48 becomes 48
for index, row in state_info.iterrows():
    states.append(state_abbrev[row['State']])
    hispanic.append((row['Hispanic'] / row['Total']) * 100)
    white.append((row['White'] / row['Total']) * 100)
    black.append((row['Black'] / row['Total']) * 100)
    native.append((row['Indian'] / row['Total']) * 100)
    asian.append((row['Asian'] / row['Total']) * 100)
    other.append(((row['Other'] + row['Hawaiian'])/ row['Total']) * 100)

state_info.drop('Hawaiian', axis=1, inplace=True)
state_info.drop('Total', axis=1, inplace=True)
state_info.rename({'Indian': 'Native'}, axis=1, inplace=True)
state_info['State'] = states
state_info['White'] = white
state_info['Black'] = black
state_info['Hispanic'] = hispanic
state_info['Asian'] = asian
state_info['Native'] = native
state_info['Other'] = other

#Joining the "state_gender" dataframe into the "state_info" dataframe so now the "state_info" dataframe includes information on gender, in addition, to race
state_info = pd.merge(state_gender, state_info, left_on='State', right_on='State', how='inner')
In [13]:
#Creating a dictionary called "state_killing_counts" where the keys are the states and the values are the number of entries per state in the "police_killings" dataframe. This dictionary has been alphabetically sorted by the key
state_killing_counts = police_killings["State"].value_counts()
state_killing_counts = dict(state_killing_counts)
state_killing_counts = dict(sorted(state_killing_counts.items()))

#Creating a dictionary called "state_population" where the keys are states and the values population per state. This data was found online. This dictionary has been alphabetically sorted by the key
state_population = {'CA': 39538223, 'TX': 29145505, 'FL': 21538187, 'NY': 20201249, 'PA': 13002700, 'IL': 12801989, 'OH': 11799448, 'GA': 10711908, 'NC': 10439388, 'MI': 10077331, 'NJ': 9288994, 'VA': 8631393, 'WA': 7705281, 'AZ': 7151502, 'MA': 7029917, 'TN': 6910840, 'IN': 6785528, 'MD': 6177224, 'MO': 6154913, 'WI': 5893718, 'CO': 5773714, 'MN': 5706494, 'SC': 5118425, 'AL': 5024279, 'LA': 4657757, 'KY': 4505836, 'OR': 4237256, 'OK': 3959353, 'CT': 3605944, 'UT': 3205958, 'IA': 3271616, 'NV': 3104614, 'AR': 3011524, 'MS': 2961279, 'KS': 2937880, 'NM': 2117522, 'NE': 1961504, 'ID': 1839106, 'WV': 1793716, 'HI': 1455271, 'NH': 1377529, 'ME': 1362359, 'RI': 1097379, 'MT': 1084225, 'DE': 989948, 'SD': 886667, 'ND': 779094, 'AK': 733391, 'DC': 689545, 'VT': 643077, 'WY': 576851}
state_population = dict(sorted(state_population.items()))

#Creating three different lists based on the keys/values of "state_killing_counts" and "state_population"
states = state_killing_counts.keys()
killings = state_killing_counts.values()
population = state_population.values() 

#Reading in a new dataset, which contains the poverty rate per state, into a new dataframe called "state_poverty"
state_poverty = pd.read_csv('poverty_rates.csv')
state = []

for index, row in state_poverty.iterrows():
    state.append(state_abbrev[row[0]])

state_poverty['State'] = state
#Creating a dictionary called "state_poverty" where the keys are states and the values are poverty per state. This dictionary has been alphabetically sorted by the key
state_poverty = dict(zip(state_poverty['State'], state_poverty['Poverty Rate']))
state_poverty = dict(sorted(state_poverty.items()))
#Creating a list based on the values of "state_poverty"
poverty = state_poverty.values()

#Reading in a new dataset, which contains the percent of people >25y/o with a high diploma, into a new dataframe called "state_diploma"
state_diploma = pd.read_csv('state_diploma.csv')
state = []

for index, row in state_diploma.iterrows():
    state.append(state_abbrev[row[0]])

state_diploma['State'] = state
state_diploma = state_diploma.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5'])
#Creating a dictionary called "state_diploma" where the keys are states and the percent of people >25y/o with a high diploma per state. This dictionary has been alphabetically sorted by the key
state_diploma = dict(zip(state_diploma['State'], state_diploma['Percent Over 25 With a High School Diploma or higher']))

state_diploma = dict(sorted(state_diploma.items()))
state_diploma.pop('PR')
#Creating a list based on the values of "state_diploma"
diploma = state_diploma.values()
In [14]:
killing_col = []
population_col = [] 
proportion_col = []
poverty_col = []
diploma_col = []

#Iterating through the "state_info" dataframe and appending to new lists (which will be added to the dataframe) based on the state, using the different dictionaries that were defined above 
for index, row in state_info.iterrows():
    killing_col.append(state_killing_counts[row['State']])
    population_col.append(state_population[row['State']])
    proportion_col.append((state_killing_counts[row['State']] / state_population[row['State']]) * 100)
    poverty_col.append(state_poverty[row['State']])
    diploma_col.append(state_diploma[row['State']])

state_info['Victim Count'] = killing_col
state_info['Population'] = population_col
state_info['Proportion of Victims'] = proportion_col
state_info['Poverty Rate'] = poverty_col
state_info['Percent of Diploma Holders'] = diploma_col

#Renaming the columns to be more meaninful to someone who is not familiar with the dataframe
state_info = state_info.rename(columns={'Male': 'Percent Male', 'Female': 'Percent Female', 'Hispanic': 'Percent Hispanic','White': 'Percent White', 
                                        'Black': 'Percent Black', 'Native':'Percent Native', 'Asian': 'Percent Asian',
                                       'Other': 'Percent Other Races', 'Proportion of Victims': 'Percent Victims of Population',
                                       'Poverty Rate':'Percent Poverty Rate', 'Percent of Diploma Holders': 'Percent Diploma Holders'})
state_info
Out[14]:
State Percent Male Percent Female Percent Hispanic Percent White Percent Black Percent Native Percent Asian Percent Other Races Victim Count Population Percent Victims of Population Percent Poverty Rate Percent Diploma Holders
0 AL 48.1 51.9 4.351991 2.262064 0.171034 0.070854 0.012098 1.835941 45 5024279 0.000896 15.98 87.93
1 AK 51.2 48.8 7.199419 3.661108 0.159297 0.468799 0.194575 2.715641 14 733391 0.001909 10.34 93.31
2 AZ 49.5 50.5 31.511985 19.637070 0.266683 0.539722 0.085614 10.982896 105 7151502 0.001468 14.12 88.97
3 AR 49.0 51.0 7.624126 3.813574 0.078191 0.107375 0.024603 3.600384 20 3011524 0.000664 16.08 88.67
4 CA 49.7 50.3 39.091445 19.540923 0.275990 0.457238 0.229576 18.587719 374 39538223 0.000946 12.58 84.45
5 CO 50.2 49.8 21.655972 14.016981 0.202325 0.428044 0.061003 6.947619 63 5773714 0.001091 9.78 92.43
6 CT 48.9 51.1 16.445986 8.179470 0.843708 0.117545 0.041310 7.263953 7 3605944 0.000194 9.78 91.11
7 DE 48.1 51.9 9.440114 5.926655 0.482495 0.074818 0.028832 2.927314 8 989948 0.000808 11.44 91.36
8 DC 47.5 52.5 11.108816 4.346885 0.850459 0.162399 0.059119 5.689954 11 689545 0.001595 15.45 92.79
9 FL 48.8 51.2 25.775772 18.232454 0.706761 0.078169 0.052750 6.705637 137 21538187 0.000636 13.34 89.79
10 GA 48.3 51.7 9.632952 5.153539 0.420888 0.166680 0.039271 3.852574 61 10711908 0.000569 14.28 88.97
11 HI 49.1 50.9 10.743525 2.574725 0.111755 0.099150 0.872771 7.085124 11 1455271 0.000756 9.26 92.93
12 ID 49.9 50.1 12.709256 7.036726 0.055747 0.237294 0.051187 5.328304 16 1839106 0.000870 11.94 91.26
13 IL 49.2 50.8 17.227648 8.938293 0.236502 0.162557 0.055913 7.834383 57 12801989 0.000445 11.99 90.17
14 IN 49.3 50.7 7.099934 3.836122 0.143440 0.054757 0.024325 3.041291 40 6785528 0.000589 12.91 90.64
15 IA 49.9 50.1 6.171629 4.078462 0.086444 0.078825 0.013429 1.914470 12 3271616 0.000367 11.11 93.32
16 KS 49.8 50.2 12.071678 7.628083 0.207374 0.154157 0.046659 4.035406 24 2937880 0.000817 11.44 91.89
17 KY 49.1 50.9 3.764025 2.164232 0.116138 0.028889 0.013783 1.440984 41 4505836 0.000910 16.61 87.99
18 LA 48.2 51.8 5.217407 2.921269 0.247866 0.055953 0.020452 1.971866 50 4657757 0.001073 18.65 86.68
19 ME 49.1 50.9 1.726027 1.037943 0.065184 0.065855 0.008428 0.548617 10 1362359 0.000734 11.07 94.53
20 MD 48.2 51.8 10.259301 4.083626 0.487377 0.076802 0.039784 5.571712 36 6177224 0.000583 9.02 91.09
21 MA 48.9 51.1 12.049173 5.772804 0.685014 0.073563 0.046748 5.471044 22 7029917 0.000313 9.85 91.10
22 MI 49.3 50.7 5.225665 3.080839 0.176029 0.071226 0.021135 1.876436 36 10077331 0.000357 13.71 91.96
23 MN 49.9 50.1 5.494034 2.672403 0.094229 0.109800 0.038249 2.579352 31 5706494 0.000543 9.33 94.13
24 MS 48.1 51.9 3.163891 1.623665 0.162853 0.035347 0.006070 1.335956 22 2961279 0.000743 19.58 86.49
25 MO 48.9 51.1 4.289192 2.467326 0.090347 0.059241 0.021293 1.650986 58 6154913 0.000942 13.01 91.59
26 MT 50.7 49.3 3.908901 2.201647 0.093058 0.264009 0.012904 1.337283 11 1084225 0.001015 12.78 94.35
27 NE 49.8 50.2 11.173152 6.971992 0.111860 0.181773 0.021052 3.886474 14 1961504 0.000714 10.37 92.16
28 NV 50.1 49.9 28.901544 13.889438 0.362277 0.388974 0.178531 14.082324 35 3104614 0.001127 12.78 87.16
29 NH 50.0 50.0 3.895387 2.403479 0.199374 0.022284 0.013503 1.256748 7 1377529 0.000508 7.42 94.44
30 NJ 49.0 51.0 20.427604 10.819300 0.771455 0.135030 0.070959 8.630860 30 9288994 0.000323 9.67 90.98
31 NM 49.0 51.0 49.202559 33.319170 0.262325 0.728891 0.112874 14.779299 41 2117522 0.001936 18.55 87.48
32 NY 48.7 51.3 19.066030 7.141987 1.357582 0.174769 0.086765 10.304927 43 20201249 0.000213 13.58 88.03
33 NC 48.3 51.7 9.541973 4.969572 0.336012 0.116009 0.027132 4.093248 67 10439388 0.000642 13.98 89.70
34 ND 51.3 48.7 3.988064 1.997123 0.065755 0.262364 0.019464 1.643359 4 779094 0.000513 10.53 93.62
35 OH 49.0 51.0 3.939428 2.171324 0.173383 0.047459 0.016496 1.530765 72 11799448 0.000610 13.62 91.74
36 OK 49.5 50.5 10.925035 6.245952 0.144885 0.379557 0.029752 4.124890 67 3959353 0.001692 15.27 88.71
37 OR 49.7 50.3 13.223976 7.676256 0.088570 0.235637 0.061776 5.161737 32 4237256 0.000755 12.36 91.87
38 PA 49.2 50.8 7.595324 3.672452 0.525624 0.067011 0.028097 3.302140 45 13002700 0.000346 11.95 91.89
39 RI 48.9 51.1 15.882711 7.553427 0.994141 0.142938 0.064284 7.127920 2 1097379 0.000182 11.58 89.14
40 SC 48.2 51.8 5.831209 3.068201 0.185289 0.056290 0.015182 2.506247 42 5118425 0.000821 14.68 89.61
41 SD 50.7 49.3 4.104006 2.263071 0.045489 0.322175 0.015694 1.457577 9 886667 0.001015 12.81 93.05
42 TN 48.7 51.3 5.569213 3.356468 0.150703 0.049673 0.016080 1.996288 56 6910840 0.000810 14.62 89.74
43 TX 49.5 50.5 39.441532 27.780769 0.338654 0.254192 0.065342 11.002575 203 29145505 0.000697 14.22 85.39
44 UT 50.5 49.5 14.155289 7.225380 0.097644 0.174154 0.035827 6.622284 22 3205958 0.000686 9.13 93.17
45 VT 49.6 50.4 2.004997 1.242752 0.074959 0.053657 0.013454 0.620175 3 643077 0.000467 10.78 94.55
46 VA 48.8 51.2 9.527981 5.101442 0.348734 0.066950 0.066891 3.943964 43 8631393 0.000498 10.01 91.38
47 WA 49.9 50.1 12.932133 6.069539 0.140100 0.214164 0.084180 6.424150 52 7705281 0.000675 10.19 92.35
48 WV 49.4 50.6 1.586732 0.972322 0.063737 0.004094 0.015492 0.531087 22 1793716 0.001227 17.10 88.82
49 WI 49.9 50.1 7.030631 3.685103 0.126796 0.083417 0.023868 3.111448 42 5893718 0.000713 10.97 93.33
50 WY 51.1 48.9 10.123712 6.782512 0.063129 0.261633 0.025114 2.991324 7 576851 0.001213 10.76 93.59
In [15]:
#Creating a heat map of the "state_info" which contains information on the correlations between the different columns
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(state_info.corr(), annot = True, linewidths=.5, ax=ax)
Out[15]:
<AxesSubplot:>

We can see there is a fairly strong linear relationship between the percent of victims of the entire population and percent male, percent hispanic, percent white, percent native, and percent poverty rates. This is just a measure of linear relationships and it is possible that other variables might have a nonlinear relationship. That said, non-linear relationships should only be used when you know that it works because of some sort of statistical test.

In [16]:
#The beginning of this code is the same as before. Python was having issues refrencing then, so they were redefined
race_killings_counts = police_killings["Race"].value_counts()
race_killings_counts = dict(race_killings_counts)

race_population = {'White': (0.593 * 331893745), 'Black': (0.136 * 331893745), 'Hispanic': (0.189 * 331893745), 'Asian': (0.061 * 331893745), 'Native': (0.013 * 331893745), 'Other': 2655149.96}

races = race_killings_counts.keys()
killings = race_killings_counts.values()
population = race_population.values()

#Creating a plot for the number of police killings per race and the total population per race
fig, axs = plt.subplots(2, figsize=(10, 10))
axs[0].bar(races, killings)
axs[0].set_title("Number of Victims of Fatal Police Brutality, per Race")
axs[0].set(xlabel='Race', ylabel='Number of Victims')
axs[1].bar(races, population)
axs[1].set_title("Population, per Race")
axs[1].set(xlabel='Race', ylabel='Population')
plt.show()
In [17]:
#Creating a new dictionary which uses the "race_killings_counts" and "race_population" defined earlier. It makes the keys the same as before, but the values the proportion of the associated values in "race_killings_counts" and "race_population"
race_killing_proportion = {}
for key in race_killings_counts:
    race_killing_proportion[key] = race_killings_counts[key] / race_population[key]

race = race_killing_proportion.keys()
proportions = race_killing_proportion.values()

#Creating a plot for the proportion of race population that were victims of fatal police brutality
plt.figure(figsize=(8, 8))
plt.bar(race, proportions)
plt.title("Proportion of Race Victims of Fatal Police Brutality")
plt.xlabel("Race")
plt.ylabel("Proportion of Race")
plt.show()

White people are the most common victims of police brutality, as measured by direct counts. However, it is important to note that white people might be the most common victims of police brutality because there is just the highest population of white people in general. Instead, I plotted the proportion of the population of the race who were victims of police brutality and found that the proportion of black people who were victims of police brutality was far greater than the other races, which suggests that police might have a racial bias.

In [18]:
#Plotting the number of entries in "police_killings" that were female/male using value_counts
gender_counts = police_killings["Gender"].value_counts()
plt.figure(figsize=(8, 8))
gender_counts.plot.pie(autopct="%.2f%%")
plt.title('Gender Distribution of Victims of Fatal Police Brutality')                                         
plt.show()                                                                                                   

Males are the most common victims of fatal police brutality; ~96% of my dataset on victims of police brutality were men. This suggests bias and stereotypes against men, especially men of color, this is known as gender based policing.

In [19]:
#Plotting the number of entries in "police_killings" that displayed signs of mental illness using value_counts
mental_illness = police_killings["Has Signs of Mental Illness"].value_counts()
plt.figure(figsize=(8, 8))
mental_illness.plot.pie(autopct="%.2f%%")
plt.title('Signs of Mental Illness Distribution of Victims of Fatal Police Brutality')                                         
plt.show()                                                                                                   

~25% of victims of fatal police brutality displayed signs of mental illness. This could suggest that police officers and law enforcement might not be well-trained in recognizing and responding to people with mental illness. Additionally, the stigma and discrimination against people with mental illness might played a part in this.

In [20]:
#Plotting the number of entries in "police_killings" where the police officer didn't have their body camera on using value_counts
body_cam = police_killings["Has Body Camera"].value_counts()
plt.figure(figsize=(8, 8))
body_cam.plot.pie(autopct="%.2f%%")
plt.title('Body Camera On Distribution')                                         
plt.show()                                                                                                   

~89% of the times that police officers were involved with fatal police brutality, their camera was off. Police are supposed to wear their body cameras to promote accountability and professionalism. However, in many cases of fatal police brutality, police don’t have their cameras on, ridding them of evidence and thus accountability.

In [21]:
#The beginning of this code is the same as before. Python was having issues refrencing then, so they were redefined
states = state_killing_counts.keys()
killings = state_killing_counts.values()
population = state_population.values()

#Creating a plot for the number of police killings per state and the total population per state
fig, axs = plt.subplots(2, figsize=(15, 15))
axs[0].bar(states, killings)
axs[0].set_title("Number of Victims of Fatal Police Brutality, per State")
axs[0].set(xlabel='State', ylabel='Number of Victims')
axs[1].bar(states, population)
axs[1].set_title("Population, per State")
axs[1].set(xlabel='State', ylabel='Population')
Out[21]:
[Text(0.5, 0, 'State'), Text(0, 0.5, 'Population')]

States such as California and Texas, which had the greatest population also had the greatest number of victims, which makes sense.

In [22]:
#The beginning of this code is the same as before. Python was having issues refrencing it, so it was redefined
poverty = state_poverty.values()

#Creating a plot for the number of police killings per state and the poverty rate per state
fig, axs = plt.subplots(2, figsize=(15, 15))
axs[0].bar(states, killings)
axs[0].set_title("Number of Victims of Fatal Police Brutality, per State")
axs[0].set(xlabel='State', ylabel='Number of Victims')
axs[1].bar(states, poverty)
axs[1].set_title("Poverty Rate, per State")
axs[1].set(xlabel='State', ylabel='Poverty Rate (%)')
Out[22]:
[Text(0.5, 0, 'State'), Text(0, 0.5, 'Poverty Rate (%)')]

There is no significant or visible relationship between poverty rates and number of victims of fatal police brutality. The states with the greatest/lowest number of victims weren’t paired with some sort of trend in the poverty rates.

In [23]:
#The beginning of this code is the same as before. Python was having issues refrencing it, so it was redefined
diploma = state_diploma.values()

#Creating a plot for the number of police killings per state and the percent of people >25 y/o with a diploma per state
fig, axs = plt.subplots(2, figsize=(15, 15))
axs[0].bar(states, killings)
axs[0].set_title("Number of Victims of Fatal Police Brutality, per State")
axs[0].set(xlabel='State', ylabel='Number of Victims')
axs[1].bar(states, diploma)
axs[1].set_title("Percent of People >25 y/o with Diploma, per State")
axs[1].set(xlabel='State', ylabel='Percent of People >25 y/o with Diploma  (%)')
axs[1].set_ylim([80, 100])
Out[23]:
(80.0, 100.0)

States such as California and Texas which have the lowest percent of people >25 y/o with a diploma also had the greatest number of victims. This is possibly because it is easier for people with more advanced levels of education to advocate for themselves and seek justice.

In [24]:
state = state_killing_counts.keys()
num_entries = state_killing_counts.values()

#Making the state_killing_counts dictionary into a dataframe, which is very compatible with choropleth
state_occurence = pd.DataFrame(list(zip(state, num_entries)),columns=['State','Occurences'])

#Plotting the number of victims of fatal police brutality in different states using plotly express's choropleth
fig = px.choropleth(state_occurence, locations='State', locationmode="USA-states", scope="usa",color='Occurences',
              color_continuous_scale="Viridis_r", )
fig.show()

Southern states such as California, Arizona, Texas, and Florida have the highest occurrences of fatal police brutality. This might have to do with this region's policies, as well as the shared biases and stereotypes of people in these regions.

Linear Regression and Hypothesis Testing¶

I wanted to understand to better understand how different demographic factors such as percent male, percent female, percent hispanic, percent white, percent black, percent native, percent asian, percent other races, percent poverty rate, and percent diploma holders can be used to determine of severity of police brutality, as measured by percent victims of total state population; hence I made the former the features of the model and the latter the target of the model. I could have studied feature importance to determine which features to fit the model to, however, I didn't want to lose any important information. It felt more fitting to use a regression in this case because I wanted to use the features to predict a more continuous variable, which is the severity of police brutality. I used hold-out validation, which resulted in me using the training set (which was 75% of the data) to train the model and using the testing set (which was 25% of the data) to test the model. I personally wish I had a larger data set because my dataset which focused on the different states was very limiting.

In [25]:
#Defining the feature variables which will be used to predict the target variable 
X = state_info.drop(columns=['State', 'Victim Count', 'Population', 'Percent Victims of Population'])
#Defining the target variavle 
y = state_info['Percent Victims of Population']

#Creating the testing and training sets, such that the testing set is 25% of the entire set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

#Running an OLS regression 
model = sm.OLS(y_train, X_train)
results = model.fit()
y_pred = results.predict(X_test)
In [26]:
#Printing out the summary statistics of said OLS regression 
print(results.summary())
                                  OLS Regression Results                                 
=========================================================================================
Dep. Variable:     Percent Victims of Population   R-squared:                       0.755
Model:                                       OLS   Adj. R-squared:                  0.687
Method:                            Least Squares   F-statistic:                     11.17
Date:                           Fri, 16 Dec 2022   Prob (F-statistic):           4.76e-07
Time:                                   10:24:40   Log-Likelihood:                 272.29
No. Observations:                             38   AIC:                            -526.6
Df Residuals:                                 29   BIC:                            -511.8
Df Model:                                      8                                         
Covariance Type:                       nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Percent Male                0.0001   7.44e-05      1.511      0.142   -3.98e-05       0.000
Percent Female              0.0001   5.41e-05      2.192      0.037    7.93e-06       0.000
Percent Hispanic            0.0004      0.000      4.169      0.000       0.000       0.001
Percent White              -0.0004      0.000     -3.819      0.001      -0.001      -0.000
Percent Black              -0.0009      0.000     -4.817      0.000      -0.001      -0.001
Percent Native              0.0019      0.000      4.813      0.000       0.001       0.003
Percent Asian               0.0004      0.000      1.360      0.184      -0.000       0.001
Percent Other Races        -0.0005      0.000     -4.529      0.000      -0.001      -0.000
Percent Poverty Rate    -2.808e-05   3.93e-05     -0.714      0.481      -0.000    5.23e-05
Percent Diploma Holders    -0.0001   5.17e-05     -2.208      0.035      -0.000   -8.42e-06
==============================================================================
Omnibus:                        0.784   Durbin-Watson:                   1.541
Prob(Omnibus):                  0.676   Jarque-Bera (JB):                0.481
Skew:                          -0.275   Prob(JB):                        0.786
Kurtosis:                       2.965   Cond. No.                     1.93e+17
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.4e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

There are some important insights in the OLS regression’s summary including: 1) The r-squared value is 0.755 which means that 75% of the variation in the target variable can be explained by the different feature variables. 2) The p-value is 4.76e-0.7 which is less than the alpha of 0.05 which means there’s a statistically significant relationship between the feature variables and the target variable. 3) Some features are more significant than others including percent female, percent hispanic, percent white, percent black, percent native, percent other races, and percent diploma holders because again, those individual variables have a p-value of less than alpha. To reiterate, the feature variables of this model are percent male, percent female, percent hispanic, percent white, percent black, percent native, percent asian, percent other races, percent poverty rate, and percent diploma holders and the target variable is the percent victims of population.

Reflection¶

Throughout the process of this project, I had the pleasure of driving the data science pipeline from start to end. Quite frankly, as this project approached, I was very nervous. Most of my Computer Science classes do not have capstones or cummulative projects so I did not know what this would entail. Additionally, our smaller projects were very guided and mimicked only small portions of the data science pipeline, so I didn't know if I was ready to do a project of such scale. However, I learned a lot and had a lot of fun with this project and altogether, this project provided me the confidence to continue to pursue similar personal projects such as this on my time. I want to say being in CMSC320 and the support provided by Max, the TAs, and my peers helped me build this confidence, so I strongly reccommend this class to any reader that is interested in taking it. Some key takeaways from this project was how heavily correlated race, gender, mentall illness, location, and more are with police brutality. This relationship I was able to predict using an OLS Linear Regression. All of this can be studied more thoroughly in my EDA section.

One personal challenge I had during this process was actually selecting a dataset; there is a great breadth of data out there and it was difficult to find where to start. However, Kaggle did provide me a great dataset to start with, although, I did do some additional web scraping for data. If anyone reading this is interested in getting into Data Science and Data Analytics, a good pleace to start is definietly Kaggle!

Altogether, thank you to Max and the awesome TAs for making this project and CMSC320 an awesome and learned experience!

Sources¶

Dataset: https://www.kaggle.com/datasets/kwullum/fatal-police-shootings-in-the-us